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01950763 SUPPLIER NUMBER: 18381388 {USE FORiyiAT 7 OR 9 FOR FULL TEXT) 

The search ends: Reviewing reviewers. (Internet search sites) (Berst Mode) 
(Internet/Web/Online Service Information) (Column) 

Berst, Jesse 

PC Week, vl3, n23, p59(l) 
June 10, 1996 

DOCUMENT TYPE: Column ISSN: 0740-1604 LANGUAGE: English 

RECORD TYPE: Fulltext; Abstract 

WORD COUNT: 580 LINE COUNT: 0004 8 

... Oracle and Microsoft. These engines understand which topics are 

related to which others. They can find relevant dociaments even if the 
keyword isn't on the page . 

^ Thesauri and topic searchers from companies such as Verity. These 
programs automatically generate and search for related terms when you 
enter a keyword . Inso's SearchWizard natural-language processor has 
similar benefits. 

* Offline researchers and search consolidators such as WebCrawler 
and Metacrawler, which go out to multiple search sites, then categorize 
and summarize what they found. 

* Collaborative filtering from companies such as Agents Inc. You 
tell a searcher about your preferences and it lists things that other 
people with similar preferences have recommended. 

The victor in this race will couple a massive AltaVista-style index... 
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01837736 SUPPLIER NOMBER: 17443197 (USE FORMAT 7 OR 9 FOR FULL TEXT) 

As the Internet grows, so do search options. (Tippecanoe Systems Teciomseh, 

and InText CP Software Group ' s Retrieval Engine Webserver SDK offer 

sophisticated online information search and retrieval) 

Nadile, Lisa 

PC Week, vl2, n37, p67(2) 
Sep 18, 1995 

ISSN: 0740-1604 LANGUAGE: English RECORD TYPE: Fulltext; Abstract 

WORD COUNT: 508 LINE COUNT: 00045 

Retrieval Engine Webserver SDK, lets users search using naturally 
worded queries. The software, which offers relevancy ranking and 
summarising features , includes unique HTML {Hypertext Markup Language) 
authoring and hyperlink technology that lets users download an HTML 
document with all links intact, officials from the San Francisco company 
said. 

By clicking on a link in a downloaded file , the software 
automatically launches the user's browser software and connects the user to 
the relevant Web site, officials said. 

The InText Retrieval Engine, which runs on Windows 3.1 and Unix, is 
priced starting at $5,000. The product is available directly from InText. 

Architext Software's agent -based technology, slated to debut this 
fall, will help users locate relevant information even if they are 
unsure of what they're looking for or the right keywords , according to 
officials from the Mountain View, Calif., company. 

Architext ' s concept-based searching features will locate 
documents with similar content and context and generate subject groups, 
abstracts, and hypertext links automatically. The product's... 
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DIALOG (R) File 16: Gale Group PROMT (R) 

(c) 2004 The Gale Group. All rts. reserv. 

04304609 Supplier Number: 46309387 (USE FORMAT 7 FOR FULLTEXT) 
GNN's WEBCRAWUER INTERNET SEARCH SERVICE UPGRADE OFFERS ENHANCED 
TECHNOLOGICAL CAPABILITIES AND NEW USER INTERFACE 

PR Newswire, p416SFTU018 
April 16, 1996 

Language: English Record Type: Fulltext 
Document Type: Newswire; Trade 
Word Count: 107 6 

... reviewed Internet sites. 

"Web surfers need ease-of-use and direction when using an Internet 
search engine," said Ted Leonsis, president of America Online Services. 
"WebCrawler ' s next-generation features and... 

...of users exploring the Web while providing a more powerful and precise 
search engine," 

The WebCrawler upgrade allows users to control how they receive 
search results. Users have the option to receive information by docximent 
title or to include document summary information. The title option lists 
search results in order of highest to lowest relevance. When summaries are 
requested, a search also returns key phrases which best describe the 
documents ' content. Another search feature of WebCrawler is the 
Similarity Search . Within the search results. Similarity Search , 
often called "more like this, " assists users in finding additional 
relevant Web pages in their area of interest. 

Significant functionality has been added to the WebCrawler service 
including the integration of GNN Select. Formerly known as WIG Select, GNN 
Select is . . . 

...indexes and newsgroups as selected and reviewed by GNN's expert 
editorial staff. GNN Select, located on the front screen of the 
WebCrawler service, is divided into 14 distinct categories, from. . . 
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02715774 (USE FORMAT 7 OR 9 FOR FULLTEXT) 
Robot- generated databases on the World Wide Web 

Kimruel, Stacey 

Database (DTB) , vl9 nl, p40-49 
Feb 1996 

ISSN: 0162-4105 JOURNAL CODE: DTB 

DOCUMENT TYPE: Feature 

LANGUAGE: English RECORD TYPE: Fulltext; Abstract 

WORD COUNT; 4 07 4 LENGTH: Long (31+ col inches) 

TEXT: 

data extraction scheme offers many of the advantages of full-text 
indexing. Lycos' search and relevance ranking features also help to 
account for this server's popularity. Lycos recently installed hardware and 
software upgrades to enhance search speed and availability of this 
service . 

WebCrawler 

http : / /webcrawler . com 

Webcrawler was developed by Brian. . . 

...operated by America Online, Inc. WebCrawler 's database contains 
information on over 220, 000 explored { retrieved and indexed) dociiments 
and 3.6 million known but unexplored dociiments . While indexing, it can 
build its database at a rate of 1,000 dociunents per hour. WebCrawler 
includes HTTP, gopher, and FTP resources, and it indexes documents in 
full text, excluding stopwords. 

The WebCrawler search form (Figure 3) allows users to select a 
Boolean connector (Any or All words) and set maximum retrieval to 10, 25, 
or 100 documents . (Figure 3 omitted) A limited truncation feature accounts 
for plural and singular forms of keywords ("s" and "es" are stripped). 
Items retrieved are ranked in order of relevance ; the results list 
includes the document title and its relevance score , displayed as a 
number normalized to 100. A View the Next (10, 25, 100) Results button lets 
users browse results^ beyond the maximum retrieval specified.- The term 
"ebola" yielded 123 hits while "pollution" found 782 hits. 

WebCrawler's easy... 
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DIALOG (R) File 810: Business Wire 

(c) 1999 Business Wire . All rts. reserv. 

0910670 BW1142 

CQN INTELLISEEK: IntelliSeek Corrects and Replaces Sept. 21 Release; 
Removes Paragraph 

September 22, 1998 

Byline: Business/Technology Editors 

...Discovery Engine (RIDE) and the BullsEye Tracker. 

— BullsEye Manager is the central point for launching 
searches . Users can save searches, analyze and refine results 

off-line, organize the information reports generated. . . 

...bookmarks, etc. With its 

intuitive graphical interface, BullsEye Manager delivers single click 
access to favorite searches and the search history. 

Intelligent Search Agents - BullsEye features nine dynamic 
search agents , accessing 300+ search engines WebSearch, NewsFinder 

PeopleTalk, BookFinder, Sof twareFinder , BusinessFinder , ColIegeFinder , 
FAQFinder and HealthAnswers Agent . All agents offer an intuitive 
interface customized to the search context including a query builder 
and numerous query assistants {Thesaurus, Spell Checker, Sounds-like, 
etc.). BullsEye's architecture makes it easy for IntelliSeek to 
continue adding agents over time. 

— Rapid Information Discovery Engine (RIDE) - RIDE offers a 
set of services that underlies the operation of all Search Agents 
bringing state of the art information processing and linguistic 
analysis technologies under one roof to... 

. . . include automatic 

document summarization, live highlighting and active linking of query 
keywords in the retrieved documents . 

— Information Tracker — Available with BullsEye Pro only, the 
Tracker uncovers new or changed information relevant.. 
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0388675 BW040 

CAERE: Caere introductions PageKeeper Portfolio for integrated document 
input and management 

March 1, 1994 

Byline: Business Editors & Computer Writers 

. . . files 

and scanned or faxed documents readily accessible to individual users . 
The key features of PageKeeper Portfolio include the following: 

— Automatic Index: PageKeeper Portfolio automatically indexes 
information without requiring any manual... 

...easily integrate paper documents and 
electronic faxes into a database of documents. 

— Intelligent Search and Retrieval : PageKeeper Portfolio uses two 
new search technologies not found in other document management software 
packages: Weighted Relevance Retrieval and Document Agent Sear-ch . 
With Weighted Relevance Retrieval , Portfolio searches a database with 
key words and then presents those documents in order of their 

relevance to the queiry . Document Agent Search 

allows users to use one 

document like an agent to find other documents 

with similar or related 
information. 

-- Data Compression: Using Caere's SuperCompression technology, 
PageKeeper Portfolio saves text and images at up to a 50:1 ratio over 
their uncompressed. . - 

...This feature allows users to store 

graphics-intensive images as well as large text-based documents . 

Visual User Interface: PageKeeper Portfolio has an intuitive 
graphical interface with an easy-to-use... 
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English Abstract 

An industry database (18) and method of creating same is provided. The 
database (18) is created in accordance with a process that includes: 
identifying a plurality of web sites (12) meeting at least one search 
criteria; automatically extracting URL addresses for each of the 
plurality of web sites; automatically categorizing each of the web sites 
and their corresponding URL addresses in accordance with a predefined 
category structure; and automatically indexing and storing each of the 
URL addresses in accordance with the predefined category structure in the 
database (18). A method of using a database system is also provided. The 
method includes: storing in a database (18), information extracted from a 
plurality of web sites (12), wherein the information is automatically 
categorized and indexed in accordance with a predefined category 
structure and includes a plurality of URL addresses corresponding to the 
plurality of web sites; receiving a user query (14); executing a search 
engine in response to the user query (14) that searches a subset of the 
stored information extracted from a subset of the plurality of web sites, 
and subsequently searching said subset of web sites to find additional 
information responsive to said user query (14). 

French Abstract 

L' invention concerne une base de donnees (18) industrielle et un precede 
permettant de creer cette base de donnees. Cette base <de donnees (18) est 



creee conf ormement a un processus qui consiste : a identifier une 
pluralite de sites Web (12) qui satisfont au moins un critere de 
recherche ; a extraire automat iquement des adresses URL pour chacun des 
sites Web ; a categoriser automat iquement chacun des sites Web et leurs 
adresses URL correspondantes conformement a une structure de categorie 
predefinie ; et a indexer et stocker automat iquement chacune des adresses 
URL conformement a la structure de categorie predefinie dans la base de 
donnees {18). L' invention concerne egalement un precede d ' utilisation 
d*un systeme de base de donnees. Ce precede consiste : a stocker dans une 
base de donnees (18), des informations extraites d'une pluralite de sites 
Web {12) r lesquelles informations sont automatiquement categorisees et 
indexees conformement a une structure de categorie predefinie et 
comprennent une pluralite d' adresses URL correspondant a la pluralite de 
sites Web ; a recevoir une demande d ' utilisateur (14) ; pour repondre a 
cette demande d * utilisateur , a executer un moteur de recherche qui 
explore un sous-ensemble des informations stockees extraites d'un 
sous-ensemble de la pluralite de sites Web, puis a explorer ledit 
ensemble de sites Web afin d*y trouver des informations supplementaires 
qui repondent a la demande d' utilisateur (14). 

Legal Status (Type, Date, Text) 

Publication 20021227 Al With international search report. 

Publication 20021227 Al Before the expiration of the time limit for 

amending the claims and to be republished in the 
event of the receipt of amendments . 

Examination 20030918 Request for preliminary examination prior to end of 

19th month from priority date 

International Patent Class: G06F-007/00 ... 

Fulltext Availability: 
Detailed Description 

Detailed Description 

below the threshold value but still exceeds the minimum preset limit, 
the entry and all relevant pages are submitted to the administrator 
for review. Additionally, in one embodiment, changes reflecting 
particular types of events (e.g., new hires, new products, etc.) may be 
monitored using key word search techniques so as to alert 
administrators of particular changes of interest. When such changes are 
detected, ail relevant pages are submitted to the administrator for 
review . 

[0037] Similarly , in one embodiment, company news pages are 
periodically 

scanned by the BioNews Engine for structure-changing messages, for 
example, I 0 like those describing merger or acquisition, strategic 
alliance etc. A set of keywords is defined for each such event and is 
matched periodically, (e.g., daily, 

once a week, etc.). Any other types of events may also be searched 
using 

appropriate key words . Any potentially relevant entries are 
extracted and 

corresponding news web pages and/or company names are submitted to an 
1 5 ...etc.) are scanned for company names present in the hifoBase 
database. The processing philosophy is similar to processing of company 
news pages discussed above. 

[0039] In addition to the proactive auto .maximum number of companies in 
the biotechnology field, hi one embodiment, the engines can be similar 
to search engines from publicly available software such as google.com. 



1007 41 The BioNews search engine provides the latest company news. In a 
preferred embodiment, a search is performed on domains (e.g., web 
sites) defined 
4 4 9 @66 

by keywords relevant for the news pages - "news", news story . news 
report" etc. In one embodiment, a human administrator purges the 
resulting list to make sure that it contains links only to head news 
pages . Alternatively or additionally, a human administrator can perform 
domain definition manually, determining news page URL addresses for 
each relevant company having a web site listed in the . . .with 
information 

pertaining to potential opportunities in the industry. In one embodiment, 
the Opportunity Engine searches pre-selected resources for relevant 
information . 

Such resources may include, for example, specific pages of university web 
sites, government research ... Technologies Ltd., Elan Corporation PLC, 
Ethypharm, etc. 

1007 61 In a preferred embodiment, inf ort-nation is retrieved and 
updated from these pre-selected web pages in accordance with the methods 
discussed above. 

Additionally, the retrieved information may be automatically 
classified, indexed 1 0 and stored in the InfoBase in a similar fashion 
to the techniques discussed above. 

[0077] In one embodiment, the Opportunity Engine searches indexed web 
pages having URLs and corresponding content stored in the InfoBase, 
when such web pages satisfy user criteria (e.g., all web pages 
associated with diagnostic companies). As described above, potentially 
relevant pages may be identified 1 5 using key word and/or class 
field searches (e.g., "licens* and diagnostic") entered by a 
member/user. Opportunity information/content stored in the InfoBase may 
be updated in a similar fashion to the techniques described above for 
updating BioField and BioNews inf on-nation . 



[00781 In. 
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Claims 

Fulltext Word Count: 19700 
English Abstract 

A search engine utilizes a bottom-up approach to index the content of a 
network with agent programs running on each source computer instead of 
relying on a top-down approach as used by conventional search engines. 
The network being indexed may be any network, including the global 
computer network or an intranet. Instead of using a central site 
including spidering software to recursively search all linked web pages 
and generate a search index of the Internet, independent distributed 
components or agent programs are located at each web site and report meta 
data about objects at the web site to the central server. A central 
catalog of object references is compiled on the central site or sites 
from the meta data reported from each web site. One or more brochure 
files may also be created and stored on each web site to provide 
conceptual or non- key-word data about the site, such as target 
demographics and categorization information. This conceptual information 
is then utilized in constructing the central catalog so that more 
accurate search results may be generated for search queries applied to 
the catalog. 

French Abstract 

Selon 1' invention, un moteur de recherche met en oeuvre une approche 
base/sommet pour indexer le contenu d'un reseau, au moyen de programmes 
comprenant des agents et fonctionnant sur chaque ordinateur source, 
plutot que de mettre en oeuvre une approche sommet/base, coimne c'est le 



cas pour les moteurs de recherche classiques. Le reseau a indexer pent 
etre n'importe quel reseau, notamment le reseau inf ormatique mondial ou 
un Intranet. Au lieu d'utiliser un site central comprenant un logiciel de 
balayage du Web, pour rechercher de maniere recursive toutes les pages du 
Web liees et produire un index de recherche de 1» Internet, des composants 
repartis, independants , ou programmes comprenant des agents, sont situes 
au niveau de chaque site Web et signalent au serveur central les 
metadonnees relatives a des objets situes au niveau du site Web. Un 
catalogue central de references d' objets est compile sur le ou les sites 
centraux, a partir des metadonnees signalees depuis chaque site Web. Un 
ou plusieurs fichiers de brochures peuvent egalement etre crees et 
stockes sur chaque site Web, afin de constituer des donnees conceptuelles 
ou de mots non cles relatifs au site, telles que des informations 
demographiques cibles et des informations de categorisation. Ces 
informations conceptuelles sont utilisees ensuite dans la construction du 
catalogue central, de facon que des resultats de recherche plus precis 
puissent etre produits pour des demandes de recherches effectuees aupres 
du catalogue. 
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Detailed Description 

The indexer also can retrieve the contents of an html page to extract 
1 5 relevant document information and index the document so that 
subsequent search queries may be applied on indexed. . . 

...allows visitors to the web server to apply search queries, and returns a 
list of docioments ranked by confidence in response to the search 
queries. Since the program resides on the web ... brochure 

17. Link URL Link Table 

18. Html tag information 

19. XML tag information 
26 

Ranking 

Table 5- Agent Created Products Catalog 

1. iii. Type of product 

2. Category three letters representing General, Specific... 
. . . URL, 

6. Unique Record Identifier 

7. iv. Product Number 

8. V. Product price 

9. A. Feature or option 

10. Feature or option 

11. Feature or option 

12. Link URL Link Table 

Table 6- Agent Created Articles & Documents Table 
I vii. Type of Articles or Dociaments 



2. Category - three letters representing General, Specific, and 
Special Interest Categories 
. . .1, 2, 3, 4, 5, 6, 7, 8, 9 & 10 

4. Subject of Articles or Docximents 

5. Site URL, 

6. Unique Record Identifier 

7. viii. Date 

8. ix. Author 

9. X. Source of Articles or Documents 
10. 

11. 

27 

. Link URL Link Table 

Table 7-Agent Created MP3 Table Fields 

I... to classify documents as they are found, and to assign concepts and 
the concept of relevancy strength to each document during parsing. 
The agent would 1 5 thereafter store these concepts as standard 
name/value extraction, where the unique statistical and logical 
characteristics of image files processed 0 by the agent are determined 
and forwarded to a central site for later... 

..understood by those skilled in the art. 

During operation, the agent can parse local image files to extract " 
features " contained within the images. For example, a file containing a 
picture of a face can . , . an example of this type of Al analysis of local 
files by the agent. 

The agent may also be used to determine the relative importance of a 
1 5 doGiament as a source or reference of information stored in linked 
documents . As an example of adult site detection, the agent might use 
a database consisting of a list of the Companies have developed search 
engine technologies that search based upon pattern matching and content 

weighting techniques .- For example, IBM has developed Query By Image 
Content {"QBIC") and a system known... 

. .below. The QBIC and CLEVER systems would be capable of using data 
produced by the agent for image, audio, and link information. The QBIC 
system uses a pattern-matching engine embedded into an IBM DB2 database 
system to compare image characteristics against a sample image. The 
results of such comparisons are then retrievable via a Structured Query 
Language ("SQL") statement. The QBIC system is intended for use in a 
keyword environment, where a keyword search produces an initial set 
of images which are then used as comparison templates and compared 
against the patternmatching engine. The CLEVER system determines 
information source documents or "hubs" 
41 

from URLs collected from one or more web sites. This is similar in 
concept to the methods described this year in a Scientific American 
article, but the CLEVER system is actually running. A source dociment 
is one that is referenced by many web pages or URLs, sometimes several 
levels removed from the document itself. A hub is defined as a page 
containing a series of links to other sites or source docviments , and is 
often referred to as a "links" page . 

In both the QBIC and CLEVER systems, a source index or collection of 
information is... applied to any system which requires transformation of 
source data into a series of data points . A sound file , for example, 
can be represented either as the time-series data (the actual digitized 
sound. . . 
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English Abstract 

A method and system are disclosed for organizing and retrieving 
information through the use of taxonomies. Documents stored in the 
organization and retrieval subsystem may be manually or automatically 
classified into a predetermined number of taxonomies. In operation, 
automatic term extractor creates a list of terms that are indicative of 
the subject matter contained in the documents. A term analysis system 
assigns the relevant terms to one or more taxonomies, and a suitable 
algorithm is then used to determine the relatedness between each list of 
terms and its associated taxonomy. The system then clusters documents for 
each taxonomy in accordance with the weights ascribed to the terms in the 
taxonomy's list and a directed acyclic graph (DAG) hierarchical structure 
is created. The present invention may then be used to aid a researcher or 
user in quickly identifying relevant documents , in response to an 
inputted query. 



French Abstract 

L' invention concerne un precede et un systeme qui utilisent des 
taxonomies pour organiser et extraire des informations. Des documents 
stockes dans les sous-systemes d' organisation et d» extraction peuvent 
etre classes manuellement ou automatiquement en un nombre preetabli de 
taxonomies. Pendant le f onctionnement , un extracteur terminologique 
automatique etablit une liste de termes indiquant la matiere traitee dans 
les documents. Un systeme d' analyse terminologique attribue les termes 
pertinents a une ou plusieurs taxonomies, et un algorithme approprie est 
ensuite utilise pour determiner le rapprochement entre chaque liste de 
termes et la taxonomie qui lui est associee. Le systeme regroupe ensuite 
les documents pour chaque taxonomie, conformement aux poids attribue aux 
termes figurant dans la liste de la taxonomie, aux fins de creer un 
graphe acyclique oriente (DAG) ou une structure hierearchique . Le precede 
de 1' invention peut ainsi etre utilise pour aider un chercheur ou un 
utilisateur a identifier rapidement des documents pertinents en reponse a 
une demande entree. 
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English Abstract 

...present invention may then be used to aid a researcher or user in 

quickly identifying relevant documents , in response to an inputted 

query . ... . . . 

Claim 

present invention may then be used to aid a researcher or user in 
quickly identifying relevant documents , in response to an inputted 
query. It may be 

appreciated that both a document's... 

...relevant response than a contentbased retrieval, which is driven by the 
actual words in the dociiment . Additional features and advantages of 
the invention will be set forth in the ...the word "truck" was in the 
query. Documents tagged to "Trucks" are likely to be relevant . 
Documents tagged to the child concept node "Pick-up" may or may not be 
relevant, but ... control the process of taxonomy tag identification 15^ 
using the text classifiers. These include threshold scores for tagging 
either documentknowledge containers or question-knowledge containers, 
and maximum numbers of tags to assign from each topic... may also consider 
sentence boundaries, section boundaries or page boundaries are considered 
as possible slice points . In general, a document should be sliced at 
points where there is a fairly substantial and permanent shift in... 

...and paragraphs N-2 and N-3, etc, up to some window size W; and 

similarly between 10 paragraph N and N+ 1; and between N- I and N+ 1, N 



.and slices 820. As shown in FIG. 8, the slicing algorithm has split 



example document into 6 similarly -sized slices 820 a-f. Each slice 
820 contains 1-3 paragraphs 720, and 2... tag taxo=lndustry tagid=fgl 
weight-1 . 0 attribution — human>Federal 
Government<tag> 

<tag taxo^ Document -Source tagid=reutl weight — 1.0 

attribution=human>Reuters</tag> 

<taxonomy-tags> 

</context> 

<content> 

IRS Reform Bill Passes 

Dateline. . .be identified by the text classifiers might be: 
Government Agencies: Federal: Legislative: Congress with (estimated) 
weight = 0.65 

Government Agencies: Federal: Executive: IRS with (estimated) weight 
- 0.75; and 

Government Issue: Legislation: New Legislation with (estimated) 
weight = 0 

Each of these three tags have associated terminology that evidences the 
presence of the. . . 

. .time>09:3 6:00</time> 
</submission-time> 
<taxonomy-tags> 

<tag taxo-lndustry tagid=fgl weight =1.0 attribut ion=human>Federal 
Government </tag> 

<tag taxo= Document -Source tagid=reutl weight =1.0 
attribution=human>Reuters</tag> 

<tag taxo=Govemment-Agencies tagid=conl weight =0.65 
attribution=machine>Congress</tag> 

<tag taxo=Govemment-Agencies tagid=irs I weight =0. 75 
attribution=machine>IRS</tag> 

<tag taxo=-Government-Issues tagid-nll weight -=0.50 

attribution=machine>New Legislat ion<tag> 

</taxonomy-tags> 

</context> 

<content> 

<title><evid value=high . . . 

, ,year><time>09:36:00</time> 
</submission-time> 
<taxonomy-tags> 

<tag taxo=lndustry tagid=fgl weight =1.0 attribution--human>Fecieral 
Government</tag> 

<tag taxo= Document -Source tagid=reutl weight =1.0 
attribution — human>Reuters</tag> 

<tag taxo=Govemment-Agencies tagid=conl weight -0.65 
attribution=machine>Congress</tag> 

<tag taxo=Goveniment-Agencies tagid-irs I weight =-0.7 5 
attribution=machine>IRS</tag> 

<tag taxo==Govermnent-Issues tagid=nll weight =0.50 
attribution=machine>New Legislation</tag> 
<tag taxo=Govemment-Of f icials tagid=lottl 
attribution=lexical . . . 

..Government Agencies" taxonomy as a topic taxonomy rather than a lexical 
one. Therefore, tagging this docioment to, e.g., "IRS" was done using a 
text-classifier over the entire text to... 

--directly to the concept-node "IRS". The topic taxonomy for Government 
Agencies indicates that the document concerns the tagged agencies; a 
lexical taxonomy that both can be useful for retrieving docxxments . 



The next step in the process involves using symbolic rules and reasoning 
in order to refine the set of tags applied to the dociiment . For 
example, the output of this process may be the determination that another 
concept node that might be relevant to 
our example content is: 

Government Issues : Legislation : Tax Legislation 

15 A knowledge-based transformation that might infer the relevance of 
this concept node 
is : 

If content is tagged to Goverm-nent Agencies : Federal : Executive : IRS 
with weight above 0.60 and content is tagged to any node under 
Government Agencies : Govermnent Issues : Legislation with weight X 
where X is greater than 0.35, add tag Government 

Issues:Legislation:Tax Legislation to the content with weight X. 
Finally, the system stores the results as a knowledge container in its 
data store. If the document had been longer, the system could 
optionally invoke slicing to break the docximent into multiple, 
contiguous sections with different topics assigned to each section. In 
this case, however. . . 

.this description will address a process for creating a knowledge map 
from a collection of documents . As explained above, Staxonomies, and by 
extension knowledge maps, may be manually constructed based on... 

.quality of operation. The input into the knowledge map generation 
mechanism is a set of docxaments and a set of "target" taxonomy root 
nodes. The output is a knowledge map. A... 

.starting point for knowledge map generation, as shown in FIG. 9, is the 
collection of documents that will be managed by the e-Service Portal 
(step 902). This collection will 20... 

.to be placed in knowledge containers. In one embodiment, the generation 
corpus has the following characteristics : (1) the documents in the 
corpus are a statistically valid sample of the documents to be managed; 
(2) there are at least 1,000 and less than 30,000 documents >• (3) there 
are at least the equivalent of 500 pages of text and no more than 
50, 000 pages of text; and (4) the documents are decomposable into 
ASCII text. The knowledge map generation process described below is 
language 5 independent. That is, so long as the documents can be 
converted into electronic text , the process is also independent of 
document format and type. The second input into the process (step 904) 
is a set of . . . 

..a valid input. First, the concept-nodes do not overlap. Second, the 
concept-nodes are relevant . 15 Third, the concept-nodes are orthogonal. 
The purpose of each root concept-node is... 

..in the knowledge map. Overlap occurs when two root nodes are provided 
that are actually identical or nearly 20 identical . In effect, the 
root concept-nodes are synonyms, and taxonomies generated from them would 
cover substantially the same portion and aspect of the knowledge 
domain. For example, the root nodes "Geography - The World. . . 

..are ascribed to a particular root, then that root concept-node is 
probably not 10 relevant . The cure is to eliminate the concept-node 
from the input set and to re . . . 

..is to have one taxonomy for each orthogonal view of knowledge within the 
corpus . 

Each document may have one or more taxonomy tags into each taxonomy. In 



an orthogonal knowledge map... of knowledge map generation. If there is 
little or no cross-tagging between two taxonomies { docximents tagged to 
one taxonomy are not tagged to another taxonomy) , then non-orthogonality 
can be . . . 

.level concept node and to re-initiate the knowledge map generation 
mechanism. Assuming valid inputs { documents and root concept-node set) , 
the invention will produce a valid output. As stated earlier... 

.node in the input set. As shown in FIG. 9, the first step (904) is 
dociiment collection. The generation corpus is a representative sample of 

documents from a single coherent knowledge domain, the representation 
of which meets the needs of a specific business problem or domain. In one 
typical scenario, an enterprise has a corpus of documents over which 
they would like to provide the retrieval and display capabilities 
described earlier in this specification. In that case, the generation 
corpus would be a subset of the enterprise's corpus of documents . The 
subset may be manually identified. In another scenario, the knowledge 
domain is well-defined. . . 

.the enterprise does not yet have a corpus covering the domain. In this 
case, representative documents must be found and accumulated to form 
the generation corpus If the available corpus is... 

.the maximum size prescribed above, sampling procedures may be employed 
to choose a subset of documents for use in the generation corpus . As 
shown in step 906, the next step is to convert the documents into XML 
marked text as described above in the portion of the document that 
addressed autocontextualization. Next, in step 908, the system performs 
root concept-node collection and. . . 

.common to all root concept-nodes within the knowledge map) . In a 
preferred embodiment, a file is prepared designating the set of root 
concept-nodes. This file is provided as an input to knowledge map 
generation and includes one record (with ail^i^ 

.9 10, the system identifies and inputs the generation corpus. In one 
embodiment, a file listing each individual 10 document in the 
generation corpus and its physical location , one per line, is provided 
as an input to knowledge map generation. In step 912, term extraction is 
then performed. Using any valid algorithm for term feature extraction, 
a list of corpus terms is generated. The term list is ordered by 
frequency or weight . This term list includes all indicators of meaning 
within the generation corpus. The term list is a function of the 
generation corpus documents - the text of these documents is read and 
parsed to produce the list. A term may have any (or none) of the 
following characteristics in any combination: a term may be 
case-sensitive ( the term "jaguar" is distinct from. . . 

.the knowledge domain associated with the generation corpus. The SME 
designates whether the term is relevant to each of the taxonomies in 
the input set. Each term may be relevant in zero to N taxonomies where 
N is the number of root concept 

nodes. For example, the term "jaguar" may be relevant to the taxonomy 
on 

"Mammals" and the taxonomy on "Automobiles". The result of this step... 

.each root concept node. The terms extracted in step 912 are 
automatically provisionally designated as relevant to zero or more 
taxonomies according to their similarity to the SME-generated term 



sets, using any word- similarity measures or algorithms from the fields 
of computational linguistics and information retrieval . These 
designations are presented to the SME for validation. Next, in step 916, 
the system. . . 

.every other taxonomy; and (3) a list of terms assigned to each taxonomy 
ordered by weight or frequency. Processing then flows to step 920, 
where the system performs diagnosis for irrelevant . . . 

.system determines whether any taxonomy is assigned a small number or 
percentage of the term/ features . If there are taxonomies that are 
assigned to a small number of terms/ features , processing flows to step 
924 and the concept node is removed from the input list ... determines that 
there is not overlap or nonorthogonality, processing flows to step 934, 
where term weighting is performed. Using 10 any standard algorithm for 
weighting a list of features in terms of relative importance, the term 
list for each taxonomy is weighted . Terms have a unique weight in 
relationship to each taxonomy to which they are ascribed. So, the term 
"jaguar" may have a low weight in relationship to the "Mammal" taxonomy 
and a high weight in relationship to the "Automobile" taxonomy and a 
zero weight (non-ascribed) in 15 relationship to a "Geography" 
taxonomy. Optionall y, the system may in step 936, subject the term 
weights generated in step 934 to review by an SME. The SME may then 
enter a new weight , replacing the computer-generated weight . One 
weighting 

algorithm has the following key characteristics : 

1 Terms with a high weight in one taxonomy have suppressed weights in all 
other ... based on vocabulary usage. That is, words occurring in a query 
that appear with the same frequency in every document contribute 
nothing to the rank of any document . At the other end of the 
spectrum, a query word that appears in only one document, and occurs many 
times in that document , greatly increases the ra.nk of that dociment 
Ranking takes into account the occurrences of a word both in the 

document being ranked and in the collection at large to be 

precise, in the indexed collection. To be . . . 

.of words that a search engine takes into account. The mathematical 
expression commonly associated with ranking is: 

Dociiment Rank = Tf /df 
where, Tf = number of times a term occurs in a document 
df = document ... green 
1510 

lffi@ - blue 
FIGn 18 

Ran g: earc Eng ne F 
20 

For each document 0. 
the rank returned by 

the search engine is 82adjusted by . . . 818079 
Forpurposes of illustration, 

rank is shown... and, where practical, search terms used) 
EPO-Internal, WPI Data, PAJ, INSPEC, IBM-TDB 
C. DOCUMENTS CONSIDERED TO BE RELEVANT 

Category o Citation of document, with indication, where appropriate, of 
the relevant passages Relevant to... 

.not cited to understand the principle or theory underlying the 
considered to be of particular relevance invention 

8E1 earlier document but published on or after the international W 
doGument of particular relevance ; the claimed invention 



filing date cannot be considered novel or cannot be considered to 
81. . . 



..is taken alone which is cited to establish the publication date of 
another o ' Yo document of particular relevance ; the claimed invention 
citation or other special reason (as specified) cannot be considered to 
involve . . . 

..the international filing dads but in the art. later than the priority 
date claimed o&' document member of the same patent family Date of 
the actual completion of the international search Date of mailing of . . . 

. . 1 of 3 

INTERNATIONAL SEkRCH REPORT Into onal Application No 
PCTAS 00/16444 

C. (Continuation) DOCUMENTS CONSIDERED TO BE RELEVANT 

Category Citation of document , with indication, where appropriate, of 
the relevant passages Relevant to claim No. 
X WONG J. . .2 of 3 

INTERNATIONAL SEARCH REPORT Into onal Application No 
PCTAS 00/16444 

C. (Corytinuation) DOCUMENTS CONSIDERED TO BE RELEVANT 

Category Citation of docximent , with indication, where appropriate, of 
the relevant passages Relevant to claim No. 
A CHAKRABARTI S. . . 



Set Items Description 

51 9991 (INDEX OR THESAURUS OR KEY) () (WORD? OR TERM? OR PHRASE?) OR 

KEYWORD? OR KEYTERM? 

52 1152186 SEARCH? OR SEEK? OR FIND? OR QUER? OR RETRIEV? OR LOCAT? 

53 1006393 AGENT? OR lA OR SPIDER? OR CRAWLER? OR WEBCRAWLER? OR BOT - 

OR (SOFTWARE?) {) (ROBOT?) OR SOFTBOT? OR BOTS 

54 283095 FILE? OR DOCUMENT? OR DATAFILE? OR ELECTRONIC () TEXT? OR ET- 

EXT? OR PAGE? 

55 1033005 RELEVAN? OR RANK? OR WEIGH? OR SCORE? OR POINTS 

56 2515527 SIMILAR? OR SAME? OR CONGRUENT? OR IDENTICAL? OR CHARACTER- 

ISTIC? OR FEATUR? 

57 21 SI AND S2 AND S3 AND S4 

58 8 S7 AND (S5 OR S6) 

59 9 S7 AND IC= (G06F-015? OR G06F-007?) 

510 14 S8 OR S9 

511 14 IDPAT (sorted in duplicate/non-duplicate order) 

512 13 IDPAT (primary/non-duplicate records only) 

513 70 S2 AND S4 AND S5 AND S6 AND SI 

514 69 S13 NOT (S7 OR S8 OR S9) 

515 6 S14 AND IC-G06F-007? 

516 69 S14 AND IC=G06F? 

517 55 S16 NOT AD>20010116 

518 13928 S4(5N)(S5 OR S6) 

519 33 S17 AND S18 
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A] 



PROBLEM TO BE SOLVED; To 
necessity of a dictionary. 



ABSTRACT 
extract a keyword 



from a docinaent without 



SOLUTION: A keyword extracting device includes a suffix file generating 
part 22 to receive a group of dociaments and to generate a suffix file 
to be described later from the group of docxaments , a suffix file 
storage part 24 to store the suffix file , a punctuating part 28 to 
receive an optional document to be included in the group of documents 
or a document in the same field as the group of documents and to 
punctuate the document at a break of a sentence such as punctuation 
marks, a score calculating part 26 to properly punctuate the sentence 
based on the suffix file and the sentence supplied from the punctuating 
part 28 and to calculate appearance frequency a, a degree B of 
concentration of appearance and weight , etc., to be described later, an 
operation result storage part 30 to store an operation result, a docximent 
separating part 32 to punctuate the document into candidates of the 
keyword based on the operation result and a narrowing part 34 to narrow 
down the candidates of the keyword . 
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ABSTRACT 

PROBLEM TO BE SOLVED: To attain retrieval of similar documents and 
also extraction of relative keywords with high accuracy and robustness by 
analyzing interpedently the weighted main components at both document 
and keyword sides in accordance with the keyword appearance frequency 
and obtaining a feature vector. 

SOLUTION: Three types of data are produced on keyword appearance 
frequency 103, document length 105 and keyword weight 107 

respectively. Then profile vectors 111 and 109 of documents and keywords 
are calculated and the weighted main component analyses 112 and 114 are 
carried out independently of each other in consideration of the length 105 
and keyword weight 107 for obtaining the feature vectors of each 

doc\ament and keyword . Then a docximent and a keyword having high 
similarity to the feature vector that is calculated from the retrieval 
/extraction condition are obtained and displayed. 
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ABSTRACT 

PROBLEM TO BE SOLVED: To automatically extract a relative keyword which 
is matched with the characteristics of a docviment to be practically 
retrieved and which is capable of obtaining one or more retrieval 
results at the time of executing retrieval using the keyword . 

SOLUTION: An automatic extraction device for relative keywords is 
provided with a dociiment set selection part 19 for specifying a partial 
set of each docximent based on the attribute information, input retrieval 

expression, etc., of the docTiment , a word statistic information 
management part 17 for managing the statistic information of respective 
words in the whole objective docviment 11 and words appearing in each 

document as well as their statistic information 15; and a word ranking 
part 18 for calculating the importance of each word appearing in a partial 
set of a certain document and for aligning respective words in the order 
of importance, wherein the management part 17 quickly finds out the 
statistic information of respective words in the whole document and a 
specified partial set of the document . Consequently, words appearing in a 
certain dociiment set can be ranked based on their importance and a part 
of the ranked words can be presented as a relative keyword . 
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ABSTRACT 

PURPOSE: To provide a docioment retrieval device capable of obtaining an 
absolutely sufficient retrieved result with the retrieval of one time 
by retrieving similar documents from a document data base with the 
document itself as a retrieval key. 

CONSTITUTION: This document retrieval device is constituted of a 

retrieval key word set generation means 2 for analyzing an input 

document 1 and generating a retrieval key word set 3 for which 
weighing corresponding to document component elements is performed and 
a document retrieval means for retrieving the document data base 

based on the retrieval key word set 3, calculating the weight of 
respective matched key words for each document olotained as a result 
and obtaining cumulative weight for the document of the retrieved 
result. Since the cumulative weight indicating the degree of similarity 
with the input document is added to the retrieved result, a user can 
efficiently select the retrieved result by referring to it. 
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Keyfact-based text search system, apparatus and method for searching 

keyfact -based text therewith 

Patent Assignee: KOREA ELECTRONICS & TELECOM RES INST (KOEL-N) ; ELECTRONICS 

& TELECOM RES INST (ELTE-N) 
Inventor: JANG M G; JUN M S; JUNG G T; PARK S Y; CHONG K T; JANG M; JUN M 
Number of Countries: 002 Number of Patents: 002 
Patent Family: 

Patent No Kind Date Applicat No Kind Date Week 

KR 2001004404 A 20010115 KR 9925035 A 19990628 200150 B 

US 6366908 Bl 20020402 US 99475743 A 19991230 200226 

Priority Applications {No Type Date) : KR 9925035 A 19990628 
Patent Details: 

Patent No Kind Lan Pg Main IPC Filing Notes 
KR 2001004404 A 1 G06F-017/30 

US 6366908 Bl G06F-017/30 

Abstract (Basic) : KR 2001004404 A 

NOVELTY - A keyfact-based text search system is provided to 
display concepts of a document with a couple of an object and 
property and to index and search text based on the couple-displayed 
data . 

DETAILED DESCRIPTION - In a keyfact-based text search system, a 
keyfact sampling device (11) samples key facts from plural key woords 
having the improved vagueness in speech by analyzing a dociament group 
to be searched and question of a user. A keyfact index device (12) 
saves a keyfact list of the entire document groups in a search 
structure of keyfact as well as calculates frequency of various 
keyfacts in the document group to be searched . A keyfact search 
device (13) receives the key facts about the question of the user and 
the other ones of the document group. The keyfact search device 
defines a keyfact-based search model and outputs the similar 
document to the question by considering a weighting constant 
depending on the type of keyfacts. 
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Document screening based on similarity of contents intrinsic to 

search document , determines frequency of keywords in each of 

screened documents preclassified siabjectwise 
Patent Assignee: TOSHIBA COMPUTER ENG KK (TOSH-N) ; TOSHIBA KK (TOKE ) 
Number of Countries: 001 Number of Patents: 001 
Patent Family: 

Patent No Kind Date Applicat No Kind Date Week 

JP 2001155020 A 20010608 JP 99334597 A 19991125 200148 B 

Priority Applications (No Type Date): JP 99334597 A 19991125 
Patent Details: 

Patent No Kind Lan Pg Main IPC Filing Notes 
JP 2001155020 A 14 G06F-017/30 

Abstract (Basic) : JP 2001155020 A 

NOVELTY - Preclassified subjectwise and held over database are 
sought to be isolated those that have adequate similarity of contents 
compared to that of search document , from among host of documents 

Frequency of keywords is determined in documents to be screened 
frequencies adjusted for size of documents . Graded weightage is 
assigned to keywords based on determined frequency and serves as 
basis for document selection. 

DETAILED DESCRIPTION - INDEPENDENT CLAIMS are also included for the 
following : 

(a) Similar dociament search procedure; 

(b) Recording medium 

USE - "Databases holding vast number of documents need to be 
scanned to locate subject specific information available with select 
document . 

ADVANTAGE - Leads to higher accuracy in the location of 
documents that have contents bearing adequate similarity to those 
obtaining with current search document . 

DESCRIPTION OF DRAWING (S) - The figure shows the block diagram of 
components of similar document search device. (Drawing includes 
non-English language text) . 
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extracted terms according to generated moduli and accepting terms with 
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Abstract (Basic) : WO 200033215 Al 

NOVELTY - Occurrences of each term extracted from dociament is 
counted to establish a frequency value for each term. ,The characters in 
each term is counted. The frequency value for each term or monotonic 
function is multiplied by character count or monotonic function to form 
modulus for each term. The terms are sorted according to the moduli and 
moduli is accepted as characteristic keyword of the document 's 
content . 

USE - In computer, world wide web for term weighting , for 
information retrieval applications such as document retrieval , 
cross-language information retrieval , keyword extraction, document 

routing, classification, categorization, clustering, document 
filtering, query expansion, chapter, paragraph and sentence 
segmentation, spelling correction, term, query and document 
similarity metrics and text summarization. 

ADVANTAGE - Size of indexes in the information retrieval 
algorithm is reduced. Document summarized is easy to implement and 
use and requires only less memory. The method is scalable because it 
does not rely on information outside the docxament and so does not 
consume more resources as the number of doc\aments increases. So the 
method is highly suitable for distributed information retrieval 
applications , 

DESCRIPTION OF DRAWING (S) - The figure shows the flow diagram 
explaining the computer program for implementing the characterizing 
terms extraction method. 

pp; 16 DwgNo 1/1 

Title Terms: TERM; EXTRACT; METHOD; COMPUTER; SORT; EXTRACT; TERM; ACCORD; 
GENERATE; MODULUS; ACCEPT; TERM; GREATER; MODULUS; CHARACTERISTIC ; 
KEYWORD ; DOCUMENT ; CONTENT 



CA CH CN 
KE KG KP 
RU SD SE 

GH GM GR 



Derwent Class: TOl 

International Patent Class (Main) : G06F-017/30 

File Segment: EPI 



Set 


Items 


Description 


SI 


9991 


{INDEX OR THESAURUS OR KEY) () (WORD? OR TERM? OR PHRASE?) OR 






KEYWORD? OR KEYTERM? 


S2 


1152186 


SEARCH? OR SEEK? OR FIND? OR QUER? OR RETRIEV? OR LOCAT? 


S3 


1006393 


AGENT? OR lA OR SPIDER? OR CRAWLER? OR WEBCRAWLER? OR BOT - 




OR (SOFTWARE?) () (ROBOT?) OR SOFTBOT? OR BOTS 


S4 


283095 


FILE? OR DOCUMENT? OR DATAFILE? OR ELECTRONIC () TEXT? OR ET- 






EXT? OR PAGE? 


S5 


1033005 


RELEVAN? OR RANK? OR WEIGH? OR SCORE? OR POINTS 


36 


2515527 


SIMILAR? OR SAME? OR CONGRUENT? OR IDENTICAL? OR CHARACTER- 






ISTIC? OR FEATUR? 


SI 


21 


SI AND S2 AND S3 AND S4 


S8 


8 


- S7 AND (S5 OR S6) 


39 


9 


S7 AND IC=(G06F-015? OR G06F-007?) 


SIO 


14 


S8 OR S9 


Sll 


14 


IDPAT (sorted in duplicate/non-duplicate order) 


S12 


13 


IDPAT (primary/non-duplicate records only) 


? sho 


w files 


File 


347:JAPIO Oct 197 6-2003/Sep (Updated 040105) 




(c) 


2004 JPO & JAPIO 



File 350:Derwent WPIX 1963-2004 /UD, UM &UP=200408 
(c) 2004 Thomson Derwent 



12/b/l (Item 1 from file: 350) 

DIALOG (R) File 350:Derwent WPIX 

(c) 2004 Thomson Derwent . All rts. reserv. 



015770961 **Image available** 

WPI Acc No: 2003-833163/200377 

Related WPI Acc No: 2002-698395 

XRPX Acc No: N03-666117 

Web search engine e.g. for Google search 
pages that are stored along with keyword 
determining intrinsic and extrinsic ranks 
content and connectivity analysis 

Patent Assignee: CHUNG S (CHUN-I); DOD A (DODA-I 
(KIMM-I); YUN Y (YUNY-I) 

Inventor: CHUNG S; DOD A; KIM B S; KIM M; YUN Y 

Number of Countries: 001 Number of Patents: 001 

Patent Family: 

Patent No Kind Date Applicat No Kind 

US 20030208482 Al 20031106 US 2001757435 A 

US 2003454452 A 



engine ranks hyper text 
in index database, by- 
according to web page 



KIM B S (KIMB-I) ; KIM M 



Date 
20010110 
20030603 



Week 
200377 
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Patent Details: 
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US 20030208482 Al 15 G06F-007/00 Div ex application US 2001757435 

Abstract (Basic) : US 20030208482 Al 

NOVELTY - A crawler (12) fetches pages from web (13) and stores 
in a database (14) . A URL management system (18) assigns identification 
number to the URL of each page . A indexer (26) parses keyword from 
the pages and stores the URL along with keywords in index database 
(28) . A ranking unit (30) ranks the hypertext pagesO based on 
intrinsic or extrinsic rank provided to the page according to the 
content and connectivity analysis. 

DETAILED DESCRIPTION - An INDEPENDENT CLAIM is also included for 
computer system. 

USE - Web search engine such as Google, FAST, Altavista, Excite, 
Yahoo, HotBot, Infoseek and Northern light search engine. 

ADVANTAGE - The most relevant pages of the search result is 
provided to the user, by assigning ranksO to the hypertext pages for 
multi- keyword query . 

DESCRIPTION OF DRAWING (S) - The figure shows the architecture of 
the search engine . 
crawler (12) 

web (13) 

URL management system (18) 

indexer (2 6) 

indexed database (28) 

ranker (30) 
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Abstract (Basic) : US 20030208472 Al 

NOVELTY - The hot keys mapped to several uniform resource locators 
(URLs), are applied to a selected keyword that is displayed on a 
document viewer. The keys trigger an access agent to retrieve URL 
associated with the hot key and the user selected keyword . The agent 

replaces the marker in the URL with the selected keyword , and 
invokes web browser on the user system, by passing URL as command 
argument . 

DETAILED DESCRIPTION - INDEPENDENT CLAIMS are also included for the 
following : 

(1) method for hyperlinking hypertext in web page ; and 

(2) document keywords linking system. 

USE - For transparently linking keywords of document displayed 
in document viewer of user computer, to web sites offering keyword 
based information look-up services. 

ADVANTAGE - The user's interaction in retrieving desired 
information on world wide web associated with keyword displayed in 
viewer, is optimized effectively. Helps Internet user's to search 
information from several information sources easily and lost link 
problem are eliminated effectively. 

DESCRIPTION OF DRAWING (S) - The figure shows a schematic view of 
the end user system connected to web site through network. 

Internet (20) 

web site (22) 

end user computer (30) 
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Abstract (Basic) : WO 200327907 Al 

NOVELTY - A user is enabled to identify interested individuals and 
electronically mail (62) relevant .objects of interest to selected 
individuals or to add objects to an action list. 

DETAILED DESCRIPTION - The method involves importing or creating 
individual profiles with associated key - phrases , and querying all 
sources of information on a network (48) based on user-based or 
individual-based key - phrases with relevant objects e.g. 
documents , retrieved based on key - phrase occurrence and based on 
association to individuals who have matching key - phrases . This 
allows a user to easily identify interested individuals and 
electronically mail (62) relevant objects of interest to selected 
individuals or to add objects to an action list. 

INDEPENDENT CLAIMS are included for; a system for providing 
information that is of interest to a group of individuals associated 
with the user; a method for identifying in a group individuals at least 
one that has an interest in information that a user possesses. 

USE - Providing a user with information that is of interest to a 
group of individuals associated with the user; a system for providing a 
user with information that is of interest to a group of individuals 
associated with the user. Exchange of timely and relevant 
communication e.g. between; brokers, agents , sales professionals, 
stock brokers, financial advisers, real estate agents , travel agents 
r insurance agents etc. 

ADVANTAGE - Provides user e.g. broker or agent with solution to 
improving and enabling communication with their clients using 
algorithms which cross-reference the interests of clients with any 
information pool and present a list of clients who are interested in 



the relevant information. 

DESCRIPTION OF DRAWING (S) - The drawing shows a schematic diagram 
illustrating the general method of the invention. 

User broker (40) 

User interface (42) 

Application database (44) 

Application search agent (4 6) 

Information provider (52,54) 
Search engines (56) 

Individual client (58) 

Web browser (60) 

Email (62) 
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Abstract (Basic) : WO 200142906 Al 

NOVELTY - The input text expression is parsed to identify keyword 
, a connector that includes graminatical rules is determined based on 
the keyword and a template in the connector is filled based on the 
input text expression. An agent identified by the connector is then 
launched and the response received as streaming HTML data. 

DETAILED DESCRIPTION - An INDEPENDENT claim is also included for 

USE - For hand-held device. 

ADVANTAGE - It allows the hand-held device to access/ retrieve 
information from the Internet without requiring processing HTML pages 
remotes using a proxy server that requires maintenance. 

DESCRIPTION OF DRAWING (S) - The figure shows flow chart of 
information access/ retrieval method. 
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Abstract (Basic) : EP 1069515 Al 

NOVELTY - Web pages in hyper space (10) may carry information 
relevant to a keyword given by a web user and a web information 
extractor provides the server with an extracted result (20) in the form 
of a single user file obtained according to three main functions. 
I.e. hyper-link traversal for finding desired information pages , 
searching and collecting them into a user file . The functions are 
performed according to a target uniform resource locator provided 
from a search engine . 

DETAILED DESCRIPTION - AN INDEPENDENT CLAIM is included for a 
method for web information extraction in an intelligent agent system. 

USE - Web information extraction. 

ADVANTAGE - Automatic hyper-link space searchingO and quick 
content collection. 

DESCRIPTION OF DRAWING (S) - The drawing is a schematic diagram of 
the overall concept of a web information extractor 

Hyper space ( 10 ) 

Extracted result (20) 
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Abstract (Basic) : GB 2335761 A 

NOVELTY - The search system uses a web agent (12 6) to manage a 
portfolio of web page profiles for user's (118, 120) stored in the 
server (112) memory (116). The user profile contains background 
information on the user with information about the user's interest in 
particular features and web pages . 

DETAILED DESCRIPTION - The user profile can be generated by either 
manual entry or a learning process based on which sites the user visits 
and length of time spent at that site. - - _ _ 

When carrying out a search for keywords the user profile is 
first used to select pages of interest to the user then these pages 
are further filtered by removing those without the keyword before the 
results are displayed. 

USE - For use as a search engine for the world wide web. 

ADVANTAGE - The addition of the user profile to the search engine 
gives a more specific list of web sites that the user may be interested 
in than a keyword search alone would provide, lowering the amount 
of sites returned by the search engine. 

DESCRIPTION OF DRAWING (S) - Block diagram of a network system 
implementing the user profile based search . 

Web server (112) 

Web server memory element (116) 

Web browser (user) (118, 120) 

Web agent for search engine (12 6) 
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Abstract (Basic) : WO 9623265 A 

The system for accessing information stored in a distributed 
information database uses a community of intelligent software agents 
(105), each of which can be built as an extension of a known viewer 
(400) for a distribution information system e.g Internet World Wide 
Web. The agent is integrated with the viewer (400) and can extract 
pages by means of the viewer (400) for storage in an intelligent page 
store , 

The text from the information system is abstracted and stored with 
additional information, selected by the user. The agent based system 
uses keyword sets to locate information of interest to the user, 
together with user profiles, such that pages being stored by one user 
can be notified to another user whose profile indicates potential 
interest . 

USE - Locating information on e.g Internet, HyperText dociaments 
located on user's internal systems etc. 
Dwg. 1/9 
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Title: Features: Real-time adaptive feature and document learning for 

web search 

Author: Chen, Z.; Meng, X.; Fowler, R.H.; Zhu, B. 

Corporate Source: Dept. of Computer Science Univ. of Texas-Pan American 
Edinburg, TX 78539-2999, United States 

Source: Journal of the American Society for Information Science and 
Technology v 52 n 8 June 2001. p 655-665 

Publication Year: 2001 

CODEN: AISJB6 ISSN: 1532-2882 

Language: English 

Document Type: JA; (Journal Article) Treatment: A; (Applications)* T- 
(Theoretical) 

Journal Announcement: 0108W1 

Abstract: In this article we report our research on building Features-an 
intelligent web search engine that is able to perform real-time adaptive 
feature (i.e., keyword } and document learning. Not only does Features 
learn from the user's docioment relevance feedback, but it also 
automatically extracts and suggests indexing keywords relevant to a 
search query and learns from the user's keyword relevance feedback so 
that it is able to speed up its search process and to enhance its search 

performance. We design two efficient and mutual-benefiting learning 
algorithms that work concurrently, one for feature learning and the other 
for^ document learning. Features employs these algorithms together with 
an internal index database and a real-time meta- searcher to perform 
adaptive real-time learning to find desired documents with as little 
relevance feedback from the user as possible. The architecture and 
performance of Features are also discussed. 29 Refs. 

Descriptors: Search engines; World Wide Web; Learning systems; Adaptive 
systems; Real time systems; Information retrieval systems; Automatic 
indexing; Learning algorithms 
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Publication Year: 1998 
CODEN: 002624 
Language: English 

Document Type: CA; (Conference Article) Treatment: T; (Theoretical) 
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Abstract; The World-Wide Web is developing very fast. Currently, finding 
useful information on the Web is a time consuming process. In this paper, 
we present WebMate, an agent that helps users to effectively browse and 
search the Web. WebMate extends the state of the art in Web-based 
information retrieval in many ways. First, it uses multiple TF-IDF 
vectors to keep track of user interests in different domains. These domains 
are automatically learned by WebMate. Second, WebMate uses the Trigger Pair 
Model to automatically extract keywords for refining dociament seaxch . 
Third, during search , the user can provide multiple pages as 
similarity / relevance guidance for the search . The system extracts and 
combines relevant keywords from these relevant pages and uses them 
for keyword refinement. Using these technigues, WebMate provides 
effective browsing and searching help and also compiles and sends to 
users personal newspaper by automatically spiding news sources . We have 
experimentally evaluated the performance of the system. {Author abstract) 
19 Refs. 
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Abstract: A system that carries out highly effective searches over 
collections of textual information is presented. The system is made up of 
two major parts. The first part consists of an agent , Musag, that learns 
to relate concepts that are semantically * similar ' to one another. The 
second part consists of another agent , Sag, which is responsible for 
retrieving documents , given a set of keywords with relative weights . 
The agents ' system architecture, along with the nature of their 
interactions, the learning and search algorithms, the notion of 'cost of 
learning* and how it influences the learning process and the quality of the 
dictionary at any given time are described. 8 Refs. 
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Abstract: The use of cited title terms of a scientific document for 
automatic indexing is explored. It offers a means of index term 
selection as well as term relevance weighting, based on author-provided 
relevance information and Hayes Theorem as in probabilistic retrieval . 
The latter quantitative consideration leads to a new measure of document - 
docviment similarity measure which is shown to have importance both for 
initial search and in relevance feedback retrieval , by offering a 
choice of iterative strategies. Extension of the concept of cited title 
terms to citing title terms shows that these two approaches are compatible 
with the current two competing models of probability of relevance for 
docioment retrieval (Robertson et al. 1982), if a document can also be 
regarded as a query . "Their term "usage may therefore provide the necessary 
statistics for parameter estimation to test both theories. (Author 
abstract) 17 refs. 
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A distinction is made between document retrieval systems and ' fact 
retrieval ' systems. It is stipulated that for the former the index 
terms should be the names of the topics dealt with by the documents in the 
system collection. Such index terms are called * document 
characteristics . ' a dociiment is then regarded as a complex assertion, 
and the problem of discovering its characteristics is defined to be that of 
isolating the referring expressions in the components of the complex 
assertion. It is shown that the type of reference discernable in simple 
sentences is preserved when such sentences are transformationally combined 
to produce complex sentences. Two methods of sentence reduction are 
examined for this purpose, vis., the derivation of microsentences and a 
kernelization program. Kernels are inefficient for document 
characterization purposes. Hence, an algorithm is constructed which 
operates on kernels to form certain micro-sentences called "assertive 
components". This algorithm together with a method for weighting the 
referring expressions of assertive components provide the means for 
assigning characteristics to any given docviment . The characteristics 
accurately denote the topics about which assertions have been made in the 
docaament , and the weighting of the characteristics supplies a means for 
assessing how much of the document's content is taken up with a discussion 
of those topics. 

Classification Codes and Description: 4.07 (Classification, Indexing, and 
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Abstract: The authors propose a PWM (Personal Web Map) system which 
gathers information through the WWW depending on a user's interests. The 
PWM is a database of relevant Web pages constructed under his/her 
interactive control. For a user's easy understanding of the gathering 
process, the system controls Web robots to keep PWM as uniform as possible 
on keywords . The gathering process is indicated using a 20 map generated 
by SOM (self-organizing map), and a user gives the system feedback through 
it. Finally, we conducted various experiments, and proved that a PWM system 
was promising for information gathering in the WWW. (13 Refs) 
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ABSTRACT 

PROBLEM TO BE SOLVED: To provide an information searching device, an 
information searching method and a recording medium by which only 
necessary contents among contents following from a prescribed start page 
can be acquired by giving a query such as 'I wish to acquire a homepage 
including one or more images, using 'KYOTO' and 'SIGHTSEEING' as a keyword 
and described in 'Japanese' to a searching robot. 

SOLUTION: Only the necessary contents are acquired by checking whether or 
not the contents is compatible to the inputted queary . 
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